1. Introduction

Here I will analyze the data that is produced in the paper: “Principles of dengue virus evolvability derived from genotype-fitness maps in human and mosquito cells” by Dolan, et al. The paper can be found here.

I have defined some helper functions in the file: helper_functions.R. A few of these functions will help us load the optimality and mutation data that is used in the analysis. The RSCU fold change data was calculated using the Dengue 2 strain 16681, which is the strain used in this paper.

Let’s load the data.

# load data
all_mutation_data = load_condensed_mutation_data()
synonymous_mutation_data = load_condensed_synonymous_mutation_data()
overall_syn_mrf = load_syn_mrf_per_species()

2. Visualization

The Mean Relative Fitness (MRF) is calculated as the average of the relative fitnesses for a particular mutation or set of mutations.

First, we will look for correlation between the MRF of synonymous mutations to each codon and codon optimality. We will only focus on synonymous mutations to avoid the influence of changing the peptide on the relative fitness.

# define plotting variables 
plot_vars = list(c('Human', 'A'), c('Human', 'B'), c('Mosquito', 'A'), c('Mosquito', 'B'))

# initialize the PDF 
pdf('./figures/syn_mut_all_against_csc.pdf')

# for loop to go through each set of plot_vars and create plot
num = 1
for (i in plot_vars) {
  assign(paste0('temp_', num), plot_mrf_all_dot(synonymous_mutation_data, i, 'mutcodon_csc', 'CSC'))
  num = num + 1
}
dev.off()
## png 
##   2

grid.arrange(temp_1, temp_2, temp_3, temp_4, nrow = 2, ncol = 2, widths = c(4, 4))

Great! There doesn’t seem to be much correlation between the Mean Relative Fitness and the codon optimality for either of the species overall.

Let’s check to see if there is a correlation between these two variables within each amino acid.

# initialize the PDF 
pdf('./figures/syn_mut_per_aa_against_csc.pdf')

# plot 
garbage = sapply(levels(factor(synonymous_mutation_data$mut_aa)), function(aa, ...) plot_mrf_per_aa_dot(synonymous_mutation_data, aa, 'mutcodon_csc', 'CSC'))

dev.off()
## png 
##   2

arg_vs_csc = plot_mrf_per_aa_dot(synonymous_mutation_data, 'Arg', 'mutcodon_csc', 'CSC')

Using Arginine as an example, we can see that mutations towards optimal codons tend have a lower Mean Relative Fitness than mutations towards non-optimal codons. This is true in both species.

Now we can check the correlations within each amino acid with a dot plot between the two replicates in each species.

# get the correlations against the csc for each aa and each replicate 
csc_correlations = synonymous_mutation_data %>%
  group_by(host, set, mut_aa) %>%
  summarize(
    correlation = cor(mutcodon_csc, mean_wrel, method = 'spearman'), 
    range = max(mutcodon_csc) - min(mutcodon_csc), 
    n_codons = n()
  )

# split the set variable into two columns, A and B
csc_correlations = spread(csc_correlations, set, correlation)

# define plot variables 
plot_vars = c('Human', 'Mosquito')

# initialize the PDF 
pdf('./figures/rep_corr_all_against_csc.pdf')

# for loop for plotting
num = 1
for (h in plot_vars) {
  assign(paste0('temp_', num), plot_corr_all(csc_correlations, h, 'AA CSC Range'))
  num = num + 1
}
dev.off()
## png 
##   2

temp_1

temp_2

The majority of the amino acids are located in the lower left quadrant for human. The amino acids for mosquito are more spread our but there are a good number in the lower left quadrant. It appears that the correlation between the Mean Relative Fitness and codon optimality is not extremely strong.

Now, let’s check the correlations of the Mean Relative Fitness vs the RSCU fold change of the codons. We will do this overall and for the individual amino acids.

# define plotting variables
plot_vars = list(c('Human', 'A'), c('Human', 'B'), c('Mosquito', 'A'), c('Mosquito', 'B'))

# initialize the PDF 
pdf('./figures/syn_mut_all_against_rscu_fc.pdf') 

# for loop to go through each set of plot_vars and create plot
num = 1
for (i in plot_vars) {
  assign(paste0('temp_', num), plot_mrf_all_dot(synonymous_mutation_data, i, 'mutcodon_rscu_fc', 'log2 RSCU FC'))
  num = num + 1
}
dev.off()
## png 
##   2

grid.arrange(temp_1, temp_2, temp_3, temp_4, nrow = 2, ncol = 2, widths = c(4, 4))

There is a strong correlation between the Mean Relative Fitness and the log2 RSCU fold change of the codon that is mutated to for humans. This indicates that mutations by Dengue towards codons with a high RSCU fold change tend to be more beneficial for the virus than mutations towards codons with a low RSCU fold change.

Although the correlation is not as strong in mosquito, the upward trend can still be seen. Importantly, mutations towards AGA (the codon with the highest RSCU fold change in both species) tend to be beneficial as well.

# initialize the PDF 
pdf('./figures/syn_mut_per_aa_against_rscu_fc.pdf')

# plot 
garbage = sapply(levels(factor(synonymous_mutation_data$mut_aa)), function(aa, ...) plot_mrf_per_aa_dot(synonymous_mutation_data, aa, 'mutcodon_rscu_fc', 'log2 RSCU FC'))

dev.off()
## png 
##   2

# plot arginine as an example 
arg_vs_rscu_fc = plot_mrf_per_aa_dot(synonymous_mutation_data, 'Arg', 'mutcodon_rscu_fc', 'log2 RSCU FC')

Using Arginine as an example again, we see that both the human and mosquito have positive correlations. Mutations towards AGA tend to be beneficial in both of the species.

Now let’s get an overview of the correlations for each amino acid.

# get the correlations against the rscu_fc for each aa and each replicate
rscu_fc_correlations = synonymous_mutation_data %>%
  group_by(host, set, mut_aa) %>%
  summarize(
    correlation = cor(mutcodon_rscu_fc, mean_wrel, method = 'spearman'), 
    range = max(mutcodon_rscu_fc) - min(mutcodon_rscu_fc), 
    n_codons = n()
  )

# split the set variable into two columns, A and B
rscu_fc_correlations = spread(rscu_fc_correlations, set, correlation)

# define plot variables 
plot_vars = c('Human', 'Mosquito')

# initialize the PDF 
pdf('./figures/rep_corr_all_against_rscu_fc.pdf')

# for loop for plotting 
num = 1
for (h in plot_vars) {
  assign(paste0('temp_', num), plot_corr_all(rscu_fc_correlations, h, 'AA RSCU FC Range'))
  num = num + 1
}
dev.off()
## png 
##   2

temp_1

temp_2

We can see that the majority of amino acids have a positive correlation between Mean Relative Fitness and the RSCU FC of the codon mutated to. This holds true for both replicates and both species. Some of the correlations we’ve seen have not been extremely strong, but we can see a trend in which synonymous mutations towards codons that have a high RSCU FC (i.e. codons preferred by Dengue relative to the host) tend to be more beneficial than mutations towards codons that have a low RSCU FC (i.e. codons not preferred by Dengue relative to the host).

Now let’s plot the Mean Relative Fitness of synonymous mutations as a barplot with the delta CSC as fill. We call the delta CSC the change in optimality that occurs when one codon is mutated to another (i.e. red means the mutatation was from a lower optimality codon to a higher optimality codon; blue means the opposite).

# select only the synonymous mutations 
all_mutation_data_synonymous = all_mutation_data %>%
  filter(muttype == 'Syn')

# initialize the PDF 
pdf('./figures/syn_mut_to_aa_codons_with_csc_as_fill.pdf', width = 14, height = 8)

# plot 
garbage = sapply(levels(factor(all_mutation_data_synonymous$mut_aa)), function(aa, ...) plot_mrf_per_aa_bar(all_mutation_data_synonymous, aa, 'delta_csc', 'Delta CSC'))

dev.off()
## png 
##   2

# plot arginine as an example 
arg_w_csc_fill = plot_mrf_per_aa_bar(all_mutation_data_synonymous, 'Arg', 'delta_csc', 'Delta CSC')

As we’ve seen before, there appears to be a weak correlation between the Mean Relative Fitness and optimality. Mutations towards codons with a lower optimality tend to have relatively high fitness, while mutations towards codons with a higher optimality have lower fitness. However, this is not consistent.

As expected, mutations towards AGA tend to be beneficial for the virus. It appears that this trend is better explained by the codons’ RSCU FC than its optimality. Let’s see if that is true.

# initialize the PDF 
pdf('./figures/syn_mut_to_aa_codons_with_rscu_fc_as_fill.pdf', width = 14, height = 8)

# plot 
garbage = sapply(levels(factor(all_mutation_data_synonymous$mut_aa)), function(aa, ...) plot_mrf_per_aa_bar(all_mutation_data_synonymous, aa, 'delta_rscu_fc', 'Delta RSCU FC'))

dev.off()
## png 
##   2

# plot arginine as an example 
arg_w_rscu_fc_fill = plot_mrf_per_aa_bar(all_mutation_data_synonymous, 'Arg', 'delta_rscu_fc', 'Delta RSCU FC')

Interesting! We can see that mutations towards codons that have a higher RSCU FC (green bars) tend to have higher fitness than mutations towards codons that have a lower RSCU FC (purple bars). Although this correlation is not perfect, the RSCU FC hypothesis seems to support the data more than the CSC hypothesis. We know from previous analysis that the codons with the highest RSCU FC tend to be non-optimal, which is likely why we see a weak correlation between Mean Relative Fitness and codon optimality.

Let’s try one more way of visualizing the Mean Relative Fitness.

# load the data
synonymous_mutation_data_fitness_class_perc = load_fitness_class_percentages_data()

# initialize the PDF 
pdf('./figures/syn_mut_to_aa_codons_with_fitness_class_as_fill.pdf', width = 14, height = 8)

# plot
garbage = sapply(levels(factor(synonymous_mutation_data_fitness_class_perc$mut_aa)), function(aa, ...) plot_status_per_aa_bar(synonymous_mutation_data_fitness_class_perc, aa))

dev.off()
## png 
##   2

# plot arginine as an example 
arg_with_status_fill = plot_status_per_aa_bar(synonymous_mutation_data_fitness_class_perc, 'Arg')

Again, we can see that mutations towards AGA are beneficial more often than mutations towards other codons. Mutations towards AGA in both humans and mosquitos are rarely deleterious and are commonly neutral. Mutations towards other codons are far more likely to be deleterious or even lethal. Interestingly, mutations towards CGC from CGT also can be beneficial in both species, although the frequency of this mutation is much lower.

We can see a preference to mutate towards codons that have a higher RSCU FC. Let’s check if we can see a preference towards the groups of codons that we defined previously (i.e. Denguenized/Non-denguenized). First, we will define the groups of codons.

# load the Dengue strain sequence 
dengue_strain = read.fasta('../4-AnalyzeSingleCellSeq/data/fastas/dengue_2_16681.fa', as.string = TRUE)

# make DNAString object 
dengue_strain_seq = DNAString(as.character(dengue_strain))

# calculate codon composition and order
dengue_strain_codons = trinucleotideFrequency(dengue_strain_seq, step = 3, as.prob = TRUE)
dengue_strain_codons = dengue_strain_codons[!(names(dengue_strain_codons) %in% c('TAA', 'TAG', 'TGA'))]
dengue_strain_codons_ordered = dengue_strain_codons[order(-dengue_strain_codons)]

# load rscu_fc data 
rscu_fc_data = load_rscu_fc()
rscu_fc_data$human_rscu_fc = round(rscu_fc_data$human_rscu_fc, 2)
rscu_fc_data$mosquito_rscu_fc = round(rscu_fc_data$mosquito_rscu_fc, 2)

# define groups 
frequently_used = dengue_strain_codons_ordered[1:16]
infrequently_used = dengue_strain_codons_ordered[46:61]

preferentially_used_human = rscu_fc_data[rscu_fc_data$human_rscu_fc >= 0.5, ]$codon
non_preferentially_used_human = rscu_fc_data[rscu_fc_data$human_rscu_fc <= -0.5, ]$codon
preferentially_used_mosquito = rscu_fc_data[rscu_fc_data$mosquito_rscu_fc >= 0.5, ]$codon
non_preferentially_used_mosquito = rscu_fc_data[rscu_fc_data$mosquito_rscu_fc <= -0.5, ]$codon

denguenized_human = intersect(names(frequently_used), preferentially_used_human)
non_denguenized_human = intersect(names(infrequently_used), non_preferentially_used_human)
denguenized_mosquito = intersect(names(frequently_used), preferentially_used_mosquito)
non_denguenized_mosquito = intersect(names(infrequently_used), non_preferentially_used_mosquito)

# make data.frames for saving
max_length_human = max(c(length(frequently_used), length(infrequently_used), length(preferentially_used_human), length(non_preferentially_used_human)))
human_codon_groups = data.frame(
  frequently_used = c(names(frequently_used), rep(NA, max_length_human - length(frequently_used))), 
  infrequently_used = c(names(infrequently_used), rep(NA, max_length_human - length(infrequently_used))), 
  preferentially_used = c(preferentially_used_human, rep(NA, max_length_human - length(preferentially_used_human))), 
  non_preferentially_used = c(non_preferentially_used_human, rep(NA, max_length_human - length(non_preferentially_used_human))), 
  denguenized = c(denguenized_human, rep(NA, max_length_human - length(denguenized_human))), 
  non_denguenized = c(non_denguenized_human, rep(NA, max_length_human - length(non_denguenized_human)))
)

max_length_mosquito = max(c(length(frequently_used), length(infrequently_used), length(preferentially_used_mosquito), length(non_preferentially_used_mosquito)))
mosquito_codon_groups = data.frame(
  frequently_used = c(names(frequently_used), rep(NA, max_length_mosquito - length(frequently_used))), 
  infrequently_used = c(names(infrequently_used), rep(NA, max_length_mosquito - length(infrequently_used))), 
  preferentially_used = c(preferentially_used_mosquito, rep(NA, max_length_mosquito - length(preferentially_used_mosquito))), 
  non_preferentially_used = c(non_preferentially_used_mosquito, rep(NA, max_length_mosquito - length(non_preferentially_used_mosquito))), 
  denguenized = c(denguenized_mosquito, rep(NA, max_length_mosquito - length(denguenized_mosquito))), 
  non_denguenized = c(non_denguenized_mosquito, rep(NA, max_length_mosquito - length(non_denguenized_mosquito)))
)

# save 
sheets = list('Human' = human_codon_groups, 'Mosquito' = mosquito_codon_groups)
write.xlsx(sheets, './data/codon_groups.xlsx')

Now, let’s plot boxplots of the Mean Relative Fitness of the synonymous mutations to the codons in each of these groups.

# add host column to the codon groups data.frame 
human_codon_groups$host = 'Human'
mosquito_codon_groups$host = 'Mosquito'

# combine the two data.frames into one 
codon_groups = rbind(human_codon_groups, mosquito_codon_groups)

# melt data.frame
melted_codon_groups = reshape2::melt(codon_groups, id = 'host')

# add a group column to link the pairs of codon groups 
melted_codon_groups$group = ifelse(melted_codon_groups$variable %in% c('frequently_used', 'infrequently_used'), 1, 
                                   ifelse(melted_codon_groups$variable %in% c('preferentially_used', 'non_preferentially_used'), 2, 3))

# rename columns 
names(melted_codon_groups) = c('host', 'codon_class', 'mutcodon', 'group')

# merge the data.frame with the mean relative fitness values 
melted_codon_groups = merge(melted_codon_groups, overall_syn_mrf, by = c('host', 'mutcodon'))

# calculate the p-values between the codon group pairs 
sig = melted_codon_groups %>%
  group_by(host, group) %>%
  wilcox_test(mean_wrel ~ codon_class, paired = FALSE) %>%
  add_significance() %>%
  add_xy_position(fun = 'median_iqr', x = 'group')
sig$y.position = 1.45

# define function to the get the sample size of the boxplots 
give.n = function(x) {
  return(c(y = 1.7, label = length(x)))
}

# plot 
syn_mut_toward_codon_groups = melted_codon_groups %>%
  ggplot(aes(x = group, y = mean_wrel, fill = codon_class, group = codon_class)) +
  geom_boxplot(outlier.shape = NA) +
  scale_fill_manual('Codon Group', values = c('frequently_used' = '#DD8505', 'infrequently_used' = '#B5025C', 
                                              'preferentially_used' = '#008A46', 'non_preferentially_used' = '#8A668A', 
                                              'denguenized' = '#CF3CD3', 'non_denguenized' = '#24C9C9'), 
                    labels = c('Frequently Used', 'Infrequently Used', 'Preferentially Used', 'Unpreferentially Used', 
                               'Denguenized', 'Non-denguenized')) +
  geom_hline(yintercept = 1, linetype = 'dashed', color = 'grey') +
  stat_pvalue_manual(sig, inherit.aes = FALSE, remove.bracket = TRUE) +
  stat_summary(fun.data = give.n, geom = "text", position = position_dodge(.75)) +
  facet_grid(host ~ .) +
  theme_classic() +
  theme(axis.text.x = element_blank(), 
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(), 
        panel.border = element_rect(size = 2, color = 'black', fill = NA), 
        strip.background = element_rect(size = 2, color = 'black', fill = NA)) +
  coord_cartesian(ylim = c(0.4, 1.7)) +
  labs(
    y = 'Mean Relative Fitness', 
    title = 'Synonymous Mutations'
  )

# save 
ggsave('./figures/syn_mut_toward_codon_groups.pdf', syn_mut_toward_codon_groups, height = 14, width = 8)

# save data.frame
write.table(
  melted_codon_groups, 
  './data/syn_mrf_codon_groups_p9.csv', 
  sep = ',', col.names = TRUE, row.names = FALSE, quote = FALSE
)

# show 
syn_mut_toward_codon_groups

Wow! We can see that synonymous mutations towards frequently used codons are more beneficial than synonymous mutations towards infrequently used codons, synonymous mutations towards preferentially used codons are more beneficial than synonymous mutations towards unpreferentially used codons, and synonymous mutations towards Denguenized codons are more beneficial than synonymous mutations towards Non-denguenized codons. This holds true for both species even though the codon groups vary between the hosts.

Lastly, as an overview, let’s see how the Mean Relative Fitness correlates between humans and mosquitos.

# split host column into two columns
overall_syn_mrf_spread = spread(overall_syn_mrf, host, mean_wrel)

# plot 
mrf_corr_human_mosquito = overall_syn_mrf_spread %>%
  ggplot(aes(x = Human, y = Mosquito)) + 
  geom_point() +
  geom_text_repel(aes(label = mutcodon)) +
  geom_hline(yintercept = 1, linetype = 'dashed', color = 'grey') +
  geom_vline(xintercept = 1, linetype = 'dashed', color = 'grey') +
  stat_cor(method = 'spearman') +
  theme_classic() +
  labs(
    x = 'Human Mean Relative Fitness', 
    y = 'Mosquito Mean Relative Fitness'
  )

# save 
pdf('./figures/mrf_corr_human_mosquito.pdf')
print(mrf_corr_human_mosquito)
dev.off()
## png 
##   2

mrf_corr_human_mosquito

There is a strong correlation between the codons that are beneficial to synonymously mutate to in humans and mosquitos. This is to be expected since Dengue can infect both hosts. There are some differences, however, which can give insight into how Dengue adapts differently to each host. Our most important codon, AGA, is beneficial to mutate to in both of the hosts.

Overall, synonymous mutations towards codons that have a higher RSCU FC (which are often non-optimal) tend to be more beneficial for Dengue than mutations towards codons that have a lower RSCU FC. This is most evident in synonymous mutations towards AGA. It appears that Dengue prefers to maintain the differences in the host’s codon usage and its codon usage during adaptation. We believe this might have something to do with manipulating the tRNA pool to make certain codons (often non-optimal) more optimal.